Here let's look at a different dataset that will allow us to really dive into some meaningful visualizations. This data set is publically available, but it is also part of a Kaggle competition.
You can get the data from here: https://www.kaggle.com/c/titanic-gettingStarted or you can use the code below to load the data from GitHub.
There are lots of iPython notebooks for looking at the Titanic data. Check them out and see if you like any better than this one!
When going through visualization options, I recommend the following steps:
Look at various high level plotting libraries like:
conda install -c conda-forge missingnoconda install nodejsjupyter labextension install @jupyterlab/plotly-extension# load the Titanic dataset
import pandas as pd
import numpy as np
print('Pandas:', pd.__version__)
print('Numpy:',np.__version__)
df = pd.read_csv('https://raw.githubusercontent.com/eclarson/DataMiningNotebooks/master/data/titanic.csv') # read in the csv file
df.head()
# note that the describe function defaults to using only some variables
df.describe()
print(df.dtypes)
print('===========')
print(df.info())
# the percentage of individuals that survived on the titanic
sum(df.Survived==1)/len(df)*100.0
# Lets aggregate by class and count survival rates
df_grouped = df.groupby(by='Pclass')
for val,grp in df_grouped:
print('There were',len(grp),'people traveling in',val,'class.')
# an example of using the groupby function with a data column
print(df_grouped['Survived'].sum())
print('---------------------------------------')
print(df_grouped.Survived.count())
print('---------------------------------------')
print(df_grouped.Survived.sum() / df_grouped.Survived.count())
# might there be a better way of displaying this data?
# Class Exercise: Create code for calculating the std error
# std / sqrt(N)
Let's start by visualizing some of the missing data in this dataset. We will use the missingno package to help visualize where the data contains NaNs. This is a great tool for looking at nan values and how we might go about filling in the values.
For this visualization, we can use a visualization library called missingno that hs many types of visuals for looking at missing data in a dataframe. I particularly like the matrix visualization, but there are many more to explore:
# this python magics will allow plot to be embedded into the notebook
import matplotlib
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter('ignore', DeprecationWarning)
%matplotlib inline
import missingno as mn
mn.matrix(df.sort_values(by=["Cabin","Embarked","Age",]))
# let's clean the dataset a little before moving on
if False: # skip this for now!, just archive it
# 1. Remove attributes that just arent useful for us
for col in ['PassengerId','Name','Cabin','Ticket']:
if col in df:
del df[col]
# 2. Impute some missing values, grouped by their Pclass and SibSp numbers
df_grouped = df.groupby(by=['Pclass','SibSp', 'Sex'])
# now use this grouping to fill the data set in each group, then transform back
# 3. create new dataframe that fills groups with the median of that group
func = lambda grp: grp.fillna(grp.median())
df_imputed = df_grouped.transform(func)
# 4. fill any deleted columns
col_deleted = list( set(df.columns) - set(df_imputed.columns)) # in case the median operation deleted columns
df_imputed[col_deleted] = df[col_deleted]
# 4. drop rows that still had missing values after grouped imputation
df_imputed.dropna(inplace=True)
# 5. Rearrange the columns
df_imputed = df_imputed[['Survived','Age','Sex','Parch','SibSp','Pclass','Fare','Embarked']]
# let's clean the dataset a little before moving on
# 1. Remove attributes that just arent useful for us
for col in ['PassengerId','Name','Cabin','Ticket']:
if col in df:
del df[col]
df.info()
# impute based upon the K closest samples (rows)
from sklearn.impute import KNNImputer
import copy
# get object for imputation
knn_obj = KNNImputer(n_neighbors=5)
# create a numpy matrix from pandas numeric values to impute
temp = df[['Pclass','Age','SibSp','Parch','Fare']].to_numpy()
# use sklearn imputation object
knn_obj.fit(temp)
temp_imputed = knn_obj.transform(temp)
# could have also done:
# temp_imputed = knn_obj.fit_transform(temp)
# this is VERY IMPORTANT, make a deep copy, not just a reference to the object
df_imputed = copy.deepcopy(df) # not just an alias
df_imputed[['Pclass','Age','SibSp','Parch','Fare']] = temp_imputed
df_imputed.info()
# let's show some very basic plotting to be sure the data looks about the same
df_imputed.Age.plot(kind='hist',alpha=0.5)
df.Age.plot(kind='hist', alpha=0.5)
plt.show()
[back to slides]
This is an example of how to make a continuous feature and ordinal feature. Let's try to give some human intuition to a variable by grouping the data by age.
Question: Does age range influence survival rates?
# let's break up the age variable
df_imputed['age_range'] = pd.cut(df_imputed['Age'],[0,15,25,65,1e6],
labels=['child','young adult','adult','senior']) # this creates a new variable
df_imputed.age_range.describe()
# now lets group with the new variable
df_grouped = df_imputed.groupby(by=['Pclass','age_range'])
print ("Percentage of survivors in each group:")
print (df_grouped.Survived.sum() / df_grouped.Survived.count() *100)
# this python magics will allow plot to be embedded into the notebook
import matplotlib
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter('ignore', DeprecationWarning)
%matplotlib inline
print('Matplotlib:', matplotlib. __version__)
# could also say "%matplotlib notebook" here to make things interactive
Pandas has plenty of plotting abilities built in. Let's take a look at a few of the different graphing capabilities of Pandas with only matplotlib. Afterward, we can make the visualizations more beautiful.
Kernel Density Estimation
KDE Example:

Question: What were the ages of people on the Titanic?
#### Plot Type Two: Histogram and Kernel Density
# Start by just plotting what we previously grouped!
plt.style.use('ggplot')
fig = plt.figure(figsize=(15,5))
plt.subplot(1,3,1)
df_imputed.Age.plot.hist(bins=20)
plt.subplot(1,3,2)
df_imputed.Age.plot.kde(bw_method=0.2)
plt.subplot(1,3,3)
df_imputed.Age.plot.hist(bins=20)
df_imputed.Age.plot.kde(bw_method=0.1, secondary_y=True)
plt.ylim([0, 0.06])
plt.show()
Estimate the joint distribution of the values of two attributes
Question: How does age relate to the fare that was paid?
plt.hist2d(x=df_imputed.Age, y=df_imputed.Fare, bins=30)
plt.colorbar()
plt.xlabel("Age")
plt.ylabel("Fare")
plt.show()
The above plot is not all that meaningful. We can probably do better than visualizing the joint distribution using 2D histograms. Let's face it: 2D histrogram are bound to be sparse and not very descriptive. Instead, let's do something smarter.
First lets visualize the correlation between the different features.
#### Plot Type Three: Heatmap (of correlation)
# plot the correlation matrix
vars_to_use = ['Survived', 'Age', 'Parch', 'SibSp', 'Pclass', 'Fare'] # pick vars
plt.pcolor(df_imputed[vars_to_use].corr()) # do the feature correlation plot
# fill in the indices
plt.yticks(np.arange(0.5, len(vars_to_use), 1), vars_to_use)
plt.xticks(np.arange(0.5, len(vars_to_use), 1), vars_to_use)
plt.colorbar()
plt.show()
Used when you have multiple categorical or nominal variables that you want to show together in sub-groups. Grouping mean to display the counts of different subgroups on the dataset. For the titanic data, this can be quite telling of the dataset.
Question: Does age, gender, or class have an effect on survival?
# first group the data
df_grouped = df_imputed.groupby(by=['Pclass','age_range'])
# tabulate survival rates of each group
survival_rate = df_grouped.Survived.sum() / df_grouped.Survived.count()
# show in a bar chart using builtin pandas API
ax = survival_rate.plot(kind='barh')
plt.title('Survival Percentages by Class and Age Range')
plt.show()
# the cross tab operator provides an easy way to get these numbers
survival = pd.crosstab([df_imputed['Pclass'],
df_imputed['age_range']], # categories to cross tabulate
df_imputed.Survived.astype(bool)) # how to group
print(survival)
survival.plot(kind='bar', stacked=True)
plt.show()
# plot overall cross tab with both groups
plt.figure(figsize=(15,3))
ax1 = plt.subplot(1,3,1)
ax2 = plt.subplot(1,3,2)
ax3 = plt.subplot(1,3,3)
pd.crosstab([df_imputed['Pclass']], # categories to cross tabulate
df_imputed.Survived.astype(bool)).plot(kind='bar', stacked=True, ax=ax1)
pd.crosstab([df_imputed['age_range']], # categories to cross tabulate
df_imputed.Survived.astype(bool)).plot(kind='bar', stacked=True, ax=ax2)
pd.crosstab([df_imputed['Sex']], # categories to cross tabulate
df_imputed.Survived.astype(bool)).plot(kind='bar', stacked=True, ax=ax3)
plt.show()
ax = df_imputed.boxplot(column='Fare', by = 'Pclass') # group by class
plt.ylabel('Fare')
plt.title('')
ax.set_yscale('log') # so that the boxplots are not squished
The problem with boxplots is that they might hide important aspects of the ditribution. For example, this plot shows data that all have the exact same boxplot.

Using pandas and matplotlib is great until you need to redo or make more intricate plots. Let's see about one or two APIs that might simplify our lives. First, let's use Seaborn.
import seaborn as sns In seaborn, we have access to a number of different plotting tools. Let's take a look at:
import seaborn as sns
# cmap = sns.diverging_palette(220, 10, as_cmap=True) # one of the many color mappings
print('Seaborn:', sns. __version__)
# now try plotting some of the previous plots, way more visually appealing!!
# sns boxplot
plt.subplots(figsize=(20, 5))
plt.subplot(1,3,1)
sns.boxplot(x="Sex", y="Age", hue="Survived", data=df_imputed)
plt.title('Boxplot Example')
plt.subplot(1,3,2)
sns.violinplot(x="Sex", y="Age", hue="Survived", data=df_imputed)
plt.title('Violin Example')
plt.subplot(1,3,3)
sns.swarmplot(x="Sex", y="Age", hue="Survived", data=df_imputed)
plt.title('Swarm Example')
plt.show()
# ASIDE: UGH so much repeated code, can we do "better"?
plt.subplots(figsize=(20, 5))
args = {'x':"Sex", 'y':"Age", 'hue':"Survived", 'data':df_imputed}
for i, plot_func in enumerate([sns.boxplot, sns.violinplot, sns.swarmplot]):
plt.subplot(1,3,i+1)
plot_func(**args) # more compact, LESS readable
plt.show()
sns.violinplot(x="Sex", y="Age", hue="Survived", data=df_imputed,
split=True, inner="quart")
plt.show()
Two versions:
Question: Which features are most similar to each other?
# the correlation plot is Feature based becasue we get
# a place in the plot for each feature
# in this plot we are asking, what features are most correlated?
cmap = sns.set(style="darkgrid") # one of the many styles to plot using
f, ax = plt.subplots(figsize=(5, 5))
sns.heatmap(df_imputed.corr(), cmap=cmap, annot=True)
f.tight_layout()
New Question: Which passengers are most similar to one another?
# but we could also be asking, what instances are most similar to each other?
# NOTE: Correlation here is defined as a distance metric by scipy
# https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.distance.correlation.html
# it is defined as 1-CC, so '0' means highly correlated
from sklearn.metrics.pairwise import pairwise_distances
vars_to_use = [ 'Age', 'Pclass', 'Fare', 'SibSp','Parch'] # pick vars
xdata = pairwise_distances(df_imputed[vars_to_use].values, # get numpy matrix
metric='correlation')
sns.heatmap(xdata, cmap=cmap, annot=False)
print('What is wrong with this plot?')
# lets fix a few things
# first, the difference between each instance was large,
# impacted by the biggest variable, Fare
from sklearn.preprocessing import StandardScaler
# lets scale the data to be zero mean, unit variance
std = StandardScaler()
xdata = pairwise_distances(std.fit_transform(df_imputed[vars_to_use].values),
metric='correlation')
sns.heatmap(xdata, cmap=cmap, annot=False)
print('Is there still something wrong?')
f, ax = plt.subplots(figsize=(8, 7))
# lets scale the data to be zero mean, unit variance
std = StandardScaler()
# and lets also sort the data
df_imputed_copy = df_imputed.copy().sort_values(by=['Pclass','Age','Survived'])
xdata = pairwise_distances(std.fit_transform(df_imputed_copy[vars_to_use].values),
metric='correlation')
sns.heatmap(xdata, cmap=cmap, annot=False)
print('Is there anything we can conclude?')
# can we make a better combined histogram and KDE?
sns.distplot(df_imputed.Age)
plt.show()
# lets make a pretty plot of the scatter matrix
df_imputed_jitter = df_imputed.copy()
df_imputed_jitter[['Parch','SibSp','Pclass']] += np.random.rand(len(df_imputed_jitter),3)/2
sns.pairplot(df_imputed_jitter, hue="Survived", size=2,
plot_kws=dict(s=20, alpha=0.15, linewidth=0))
plt.show()
The best plots that you can make are probably ones that are completely custom to the task or question you are trying to solve/answer. These plots are also the most difficult to get correct because they take a great deal of iteration, time, and effort to get perfected. They also take some time to explain. There is a delicate balance between creating a new plot that answers exactly what you are asking (in the best way possible) and spending and inordinate amount of time on a new plot (when a standard plot might be a "pretty good" answer)

More updates to come to this section of the notebook. Plotly is a major step in the direction of using JavaScript and python together and I would argue it has a much better implementation than other packages.
# directly from the getting started example...
import plotly
print('Plotly:', plotly. __version__)
plotly.offline.init_notebook_mode() # run at the start of every notebook
plotly.offline.iplot({
"data": [{
"x": [1, 2, 3],
"y": [4, 2, 5]
}],
"layout": {
"title": "hello world"
}
})
from plotly.graph_objs import Scatter, Layout
from plotly.graph_objs.scatter import Marker
from plotly.graph_objs.layout import XAxis, YAxis
# let's manipulate the example to serve our purposes
# plotly allows us to create JS graph elements, like a scatter object
plotly.offline.iplot({
'data':[
Scatter(x=df_imputed.SibSp.values+np.random.rand(*df_imputed.SibSp.shape)/2,
y=df_imputed.Age,
text=df_imputed.Survived.values.astype(str),
marker=Marker(size=df_imputed.Fare, sizemode='area', sizeref=1,),
mode='markers')
],
'layout': Layout(xaxis=XAxis(title='Sibling and Spouses'),
yaxis=YAxis(title='Age'),
title='Age and Family Size (Marker Size==Fare)')
}, show_link=False)
Visualizing more than three attributes requires a good deal of thought. In the following graph, lets use interactivity to help bolster the analysis. We will create a graph with custom text overlays that help refine the passenger we are looking at. We will
def get_text(df_row):
return 'Age: %d<br>Class: %d<br>Fare: %.2f<br>SibSpouse: %d<br>ParChildren: %d'%(df_row.Age,df_row.Pclass,df_row.Fare,df_row.SibSp,df_row.Parch)
df_imputed['text'] = df_imputed.apply(get_text,axis=1)
textstring = ['Perished','Survived', ]
plotly.offline.iplot({
'data': [ # creates a list using a comprehension
Scatter(x=df_imputed.Pclass[df_imputed.Survived==val].values+np.random.rand(*df_imputed.SibSp[df_imputed.Survived==val].shape)/2,
y=df_imputed.Age[df_imputed.Survived==val],
text=df_imputed.text[df_imputed.Survived==val].values.astype(str),
marker=Marker(size=df_imputed[df_imputed.Survived==val].SibSp, sizemode='area', sizeref=0.01,),
mode='markers',
name=textstring[val]) for val in [0,1]
],
'layout': Layout(xaxis=XAxis(title='Social Class'),
yaxis=YAxis(title='Age'),
title='Age and Class Scatter Plot, Size = number of siblings and spouses'),
}, show_link=False)
Check more about using plotly here:
In this notebook you learned:
Todo: create and use some Bokeh examples here